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Abstract. Content Analysis System (CoAnSys) is a research framework 
for mining scientific publications using Apache Hadoop. This article de- 
scribes the algorithms currently implemented in CoAnSys including clas- 
sification, categorization and citation matching of scientific publications. 
The size of the input data classifies these algorithms in the range of big 
data problems, which can be efficiently solved on Hadoop clusters. 
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1 Introduction 

Growing amount of data is one of the biggest challenges both in commercial and 
scientific applications pQ. General intuition is that well embraced information 
may give additional insight into phenomena occurring in the data. To meet this 
expectation, Google proposed the MapReduce paradigm, which open-source im- 
plementation is Apache Hadoop. The ecosystem of Apache Hadoop gives a way 
to efficiently use hardware resources and conveniently describe data manipula- 
tions. In the Centre for Open Science (CeON) we employed that solution and 
produced Content Analysis System (CoAnSys) - the framework for finer scien- 
tific publication mining. CoAnSys enables data engineers to easily implement 
any data mining algorithm and chain data transformations into workflows. Dur- 
ing the development of CoAnSys, the set of good implementation practices and 
techniques has clarified [2]. In this article we share a practical knowledge in 
the ground of big data implementations, based on the three use cases: citation 
matching, document similarity and document classification. 

The rest of this paper is organized as follows. Section 2 presents an overview 
of CoAnSys. Section 3 describes algorithms developed at CeON, which are well 
suited for MapReduce paradigm. Section 4 contains conclusions and future plans. 



2 CoAnSys 



The main goal of CoAnSys is to provide a framework for processing a large 
amount of text data. Currently implemented algorithms allow for knowledge ex- 
traction from scientific publications. Similar software systems include Behemoth 
0, UIMA [3], Synat @], OpenAIRE [5] and currently developed OpenAIREplus 
[5] . The difference between CoAnSys and aforementioned tools lies in the imple- 
mentation of algorithms. CoAnSys is used to conduct a research in text mining 
and machine learning, all methods implemented in that framework have been 
already published or will be published in a future. An architecture overview of 
CoAnSys is illustrated in Fig[TJ 




Fig. 1. A generic architecture of CoAnSys. 



While designing the framework, we paid a close attention to the input/output 
interfaces. For this purpose CoAnSys employs Protocol BufiertH - a widely used 
method of serializing data into a compact binary format. Serialized data is then 
imported into the HBase using REST protocol. This allows for simultaneous im- 
port of data from multiple clients. On the other hand, querying a large number of 
records from HBase is slower than performing the same operation on a sequence 
file stored in the HDFS. Therefore, in the input phase of CoAnSys workflow, the 
data is copied from an HBase table to an HDFS sequence file and such format 
is recognized as a valid input for the algorithms. 

Six modules currently implemented in CoAnSys are illustrated in the Algo- 
rithms box in FigJTJ Each module performs a series of MapReduce jobs that are 
implemented in Java, Pigd or Scala. Apache Oozi^] is used as a workflow sched- 
uler system that chains modules together. Each module has well defined I/O 
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interfaces in the form of Protocol Buffers schemas. This means, that sequence 
files are also used as a communication layer between modules. The output data 
from each workflow is first stored as an HDFS partial result (sequence file con- 
taining records serialized with Protocol Buffers) and then it is exported to the 
output HBase table where it can be accessed via REST. 

Even though CoAnSys is still in an active development stage, there are at 
least three ongoing projects that will utilize parts of CoAnSys framework. POL- 
ori is an information system about higher education in Poland. SYNAI0 is a 
Polish national strategic research program to build an interdisciplinary system 
for interactive scientific information. OpenAIREplutQ is the European open ac- 
cess data infrastructure for scholarly and scientific communication. 

3 Well-suited Algorithms 

In this section a few examples of Map Reduce friendly algorithms are presented. 
MapReduce paradigm put a certain set of constraints, which are not acceptable 
for all algorithms. From the very beginning the main afford in CoAnSys have 
been put on document analysis algorithms, i.e. author name disambiguation 7,8, 
[3], metadata extraction [10] . document similarity and classification calculations 
[TTT [T2] , citation matching [T31 [T3] , etc. Some of algorithms can be used in Hadoop 
environment out-of-box, some need further amendments and some are entirely 
not applicable [15] . 

For the sake of clarity, the description of the algorithms focuses on the imple- 
mentation techniques (such as performance improvements), while the enhance- 
ments intended to elevate accuracy and precision are omitted. 

3.1 Citation Matching - General Description 

A task almost always performed when researching scientific publications is the 
citation resolution. It aims for matching citation strings against the documents 
they reference. As it consists of many similar and independent subtasks, it can 
greatly benefit from the use of MapReduce paradigm. We may describe citation 
matching in the following, illustrated in FigEl steps: 

1. Retrieve documents from the store. 

2. Map each document into its references (i.e. extract reference strings). 

3. Map each reference to the corresponding document (i.e. the actual match- 
ing). 

(a) Heuristically select best matching documents (map step). 

(b) Among them return the one with the biggest similarity (but not smaller 
than a given threshold) to the reference (reduce step). 

4. Persist the results. 



5 http: //polon.nauka. go v. pi | 

6 http : / /www . synat . pi/ 

7 http://www.openaire.eu/ 



Documents References Best matching pairs 

candidate documents 




document reference string heuristic matchin selecting storing 

reading extraction ' " " U " ' ' " " ' the best matching resolved references 



Fig. 2. Citation matching steps. At first, the documents from appropriate Sequence- 
File are read and their metadata is extracted. Then, in the first step of citation match- 
ing, a heuristic is used to find documents that may match each citation. In the next 
step, the best match for each citation is selected. Finally, the results are persisted in a 
SequenceFile. Note that steps that transform one entry into many can be implemented 
as mapping and those transforming many into one as reduction. 



3.2 Citation matching - Implementation Details 

Index Heuristic matching is done using an approximate author index allowing 
retrieval of elements with edit distance [16] lesser or equal to 1. We have managed 
to design it to fit MapRcduce paradigm and Hadoop environment in particular. 
It is implementing the ideas presented by Manning et al. in Chapter 3 of [17] . 

To store the index, we needed a data structure that would enable fast retrieval 
as well as scanning of sorted entities. Hadoop MapFile turned out to be a good 
solution. It extends capabilities of a SequenceFile (which is a basic way of storing 
key- value pairs in the HDFS) by adding an index to the data stored in it. The 
elements of a MapFile can be retrieved quickly, as the index is usually small 
enough to fit in a memory. The data is required to be sorted which makes 
changing a MapFile laborious, yet it is not a problem since our indices are 
created from scratch for each algorithm execution. Hadoop exposes an API for 
MapFile manipulation which provides operations such as sorting, entity retrieval 
and data scanning. 

Distributed Cache Every worker node needs to access the index (the whole 
index, not just a part of MapFile) . It seemed, therefore, to be a good idea to store 
it in the HDFS. Unfortunately, this approach has a serious performance issues, 
because the index is queried very often. The speed of a network connection is 
a main bottleneck here. While seeking for a better solution, we have noticed 
the Hadoop Distributed Cache. It allows to distribute some data among worker 
nodes so that it can be accessed locally. The achieved performance boost was 
enormous - citation matching on the sample of 2000 documents worked four 
times faster. 



Scala and Scoobi As MapReduce originates in a functional programming 
paradigm, one might suppose it would fit well Scala language. Indeed, during 
citation matching implementation, we have exercised Scoobjj library which en- 
ables easy Hadoop programming in Scala by providing an API similar to Scala's 
native collections. This way a very clean code can be written. In spite of great 
reduction of the boilerplate, Scoobi does not restrict access to some low level 
Hadoop features. When one desires complete control over job execution, though, 
the default Hadoop API may need to be used. 

Task Merging Fig J5] shows subsequent map steps (which could be implemented 
as MapReduce jobs with zero-reducers). Sometimes it might be beneficial to 
merge such tasks, as effectiveness may be improved by avoiding intermediate 
data storage and additional initialization cost. That is what Scoobi tends to do 
when computing an execution plan. This leads to a parallelism reduction, which, 
in turn, can negatively impact the performance. 

For instance, suppose we want to process two documents, first containing 
one citation and second containing fifty. In addition, let's assume that we are 
using the cluster of two nodes. If citation extraction and heuristic matching 
steps are merged, then the first mapper would extract and match one citation 
and the second would have to process fifty of them. On the other hand, if the 
tasks remain independent after citation extraction and before actual matching, 
a load balancing will occur. As a result, a citation matching workload will be 
more equally distributed. Unfortunately, Scoobi does not allow for task merging 
prevention and eventually this part of the process has been implemented using 
the low-level Hadoop API. 

3.3 Document Similarity - General Description 

A good illustration of the well suited MapReduce problem is the computation 
of a document similarity in a large collection of documents, assuming that the 
similarity between two documents is expressed as the similarity between weights 
of their common terms. Such an approach divides the computation into two 
consecutive steps: 

1. the calculation of weights of terms for each document 

2. the invocation of a given similarity function on weights of terms related to 
each pair of documents 

In our current implementation, the term frequency inverse-document frequency 
(TFIDF) measure and the cosine similarity have been used to produce weights for 
terms and calculate their similarity respectively. The process is briefly depicted 
in FigEl 

8 http: //nicta. github . com/ scoobi/ 
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Fig. 3. Document similarity steps. At first, each document is split into terms 
and the importance of each term to a document is calculated using TFIDF 
measure (resulting in the vector of weights of terms for each document). 
Then, documents are grouped together into pairs, and for each pair, the 
similarity is calculated based on the vectors of weights of common terms 
associated with the documents. 



Term Weighting Algorithm TFIDF is a well-known information retrieval 
algorithm that measures how important a word is to a document in a collection 
of documents. Generally speaking, the word becomes more important to a doc- 
ument, if it appears frequently in this document, but rarely in other documents. 
Formally, TFIDF can be described by EqEQH 

tfidfij = tfij * idfi (!) 

where 

— riij is an occurrence of a term in document dj 

— D is a corpus of documents 

— d : ti £ d, a document d containing a term ti 

The way to use TFIDF in the document similarity calculation is presented in 
Eqg] 

E tfidfi, x * tfidf itV 

cosineDocSim(d x ,d y ) — — = r^^^=^^^= ,for x < y < \D\ (4) 

E tfidft*J E tfidfly 



For each document d, this algorithm produces the vector Wd of term weights 
Wt t d which indicates the importance of each term t to the document. Since it con- 
sists of separate aggregation and multiplication steps, it can be nicely expressed 
in the MapReduce model with several map and reduce phases [TH] HU [2D] . In ad- 
dition, several popular techniques that increase an efficiency and a performance 
of the algorithm have been deployed: 

1. stop words filtering (based on a predefined stop- word list) 

2. stemming (the Porter stemming algorithm [21]) 

3. applying n-grams to extract phrases (considering a statistically frequent n- 
gram as a phrase) 

4. removal of the terms with the highest frequencies in a corpus (automatically, 
but with a parametrized threshold) 

5. weights tuning (based on the sections where a given term appears). 

Similarity Function Having each document represented by the vector Wd of 
term weights u>t,d, one can use many well known functions (e.g inner product, 
cosine similarity) to measure similarity between the pair of vectors. In our im- 
plementation, we follow ideas from |22j , but provide more generic mechanism to 
deploy any similarity function that implements one-method interface specified by 
us i.e. similarity (id(di),id(dj), sort(wt i ,d i ), sort(wt j t dj)) (where id(di) denotes 
document id, and sort(w tu di) is a list of weights of common terms ordered by 
terms lexicographically). The similarity function receives only terms that have 
non-zero weights in both vectors, thus, the final score is calculated faster. This 
assumption remains valid only for the pairs of documents that have at least one 
common term. 

3.4 Document Similarity - Implementation Details 

Language Choice Although both algorithms, TFIDF and vector similarity, 
can be easily expressed in the MapReduce model, they require multiple map 
and reduce passes, what contributes to a verbose code. In order to make the 
code easier to maintain and less time-consuming to implement, CeON team uses 
Apache Pig (enhanced by UDFs, User Defined Functions written in Java). 

Apache Pig provides a high-level language (called PigLatin) for expressing 
data analysis programs. PigLatin supports many traditional data operations 
(e.g. group by, join, sort, filter, union, distinct). These operations are highly 
beneficial in multiple places such as stop words filtering, self-joining TFIDF's 
output relations or grouping relations with a given condition. 

Input Dataset Document similarity module takes advantage of rich metadata 
information associated with each document. Keywords needed to compute the 
similarity are extracted from the title, the abstract and the content of a publica- 
tion and then, they are combined with the keyword list stored in the metadata. 
The information in which sections a given keyword appears is taken into account 
during the computation of the final weights Wt,d in the TFIDF algorithm. A user 
may configure how important a given section is in the final score. 



Additional Knowledge The main output of document similarity module is 
the set of triples in form of (di, dj, sim^j), where di and dj are documents and 
sirriij denotes the similarity between them. However, during the execution of 
this module, an additional output is generated. It contains potentially useful 
information such as: 

— top N terms with the highest frequencies that might be considered as addi- 
tional stop words 

— top N terms with the highest importance to a given document 

— top N articles with the lowest and highest number of distinct words. 

3.5 Document Classification - General Description 

Besides a natural application of document similarity to a basic, unpersonalizcd 
recommendation system, it may also be used in a document classification based 
on the k-nearest neighbors algorithm. 

In the context of document classification, it is important to distinguish two 
topics - a model creation (MC) and a classification code assignment (CCA), 
each of which starts in the same way, as depicted in FigfJJ In the first step, 
documents are split into two groups - classified and unclassified (in case of MC 
both groups contain the same documents). Then, TFIDF is calculated for both 
of these groups. Finally, document similarity between groups is calculated (ex- 
cluding self-similarity) and for each document from unclassified group, n closest 
neighbors are retained. After this initial phase, the subsequent step is different 
for MC and CCA. 

For MC, the classification codes from neighbors of a document are extracted 
and counted. Then, for each classification code, the best threshold is selected 
against given criteria, e.g. an accuracy or a precision to fit "unclassified" docu- 
ments classification codes. Finally, the pairs (classification code, threshold) are 
persisted. 

In case of CCA, after extraction of the classification code, the number of 
classification code occurrences are compared with a classification code threshold 
from a model and retained if greater or equal. 

3.6 Document Classification - Implementation Details 

Sequence of Operations The amount of data to be transferred between 
Hadoop nodes has a great influence on the performance of the whole work- 
flow. Therefore, operations depicted in Figf4] should be considered in two dimen- 
sions. First one is the TFIDF calculation, for which only documents' metadata 
is needed. Subsequently, only information about document ID and its TFIDF 
are needed. The second dimension refers to the splitting into subsets "unclas- 
sified" / "classified" or into folds for the sake of n-fold cross validation. Because 
division operations can be collapsed into one, it is important to put all of them 
in the first place, followed by the document similarity calculations, and do not 
place TFIDF calculation in between. 
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Fig. 4. The initial phase for Model Creation (MC) and Classification Code 
Assignment (CCA). At first documents are split into "classified" and "unclassified" 
group (for MC both groups contains the same metadata) and TFIDF measure is cal- 
culated over the whole set. Then, cosine similarity is calculated between documents in 
each group and n most similar "classified" documents are retained. 



Language Choice For data scientists dedicated to implement and enhance 
algorithms in MapReduce, it is crucial to take advantage of programming lan- 
guages created specifically for MapReduce. Again, Apache Pig looms as the 
natural candidate for document classification. Besides its strengths (predefined 
functions, UDFs), it should be noted that Pig (as a MapReduce paradigm) lacks 
some general purpose instructions like loops or conditional statements. How- 
ever, it is easy to encapsulate Pig scripts into workflow management tools such 
as Apache Oozie or simply to use Bash shell which offers such operations. More- 
over, due to the presence of macro and import statements, one can abbreviate 
a size of description by extracting popular transformations into macros and in- 
serting them into separate files. In this approach, a variant of an operation (e.g. 
a way of calculating document similarity) can be passed to a general script as a 
parameter used in the import statement. 

For the sake of optimization of memory utilization and calculation speed 
improvement it is important to use specialized types of a general operation. In 
case of join operation, there are dedicated types for joining small data with 
a large one ("replicated join"), joining data with a mixed, undetermined size 
("skewed join") and joining sorted data ("merge join"). 

Data Storage The most utilized ways of storing data in the Apache Hadoop 
ecosystem are Hadoop database - HBase and Hadoop file system - HDFS. When 
massive data calculations are considered, then the better choice is the HDFS. 
When many calculating units are trying to connect to the HBase, then not all 
of them may be served before timeout expires. That results in a chain of failures 
(tasks assigned to calculation units are passed from failed to working ones, which 



become more and more overwhelmed by the amount of data to process). On the 
other hand, such failure cannot happen when the HDFS is used. 

Using HDFS in MapReduce jobs requires pre-packing data into Sequence- 
Files, which store data in the form of (Key, Value) pair. To obtain the most 
generic form, it is recommended to collect the key and value objects as a Bytes Writable 
class, where a value object contains data serialized as ProtocolBuffers. This ap- 
proach makes it easy to store and extend schema of any kind of data. Our expe- 
rience is that reading and writing (Bytes Writable, Bytes Writable) pairs, opposed 
to Java and Scala usage, results in some complications in Apache Pig v. 0.9. 2. 
In that case, one may consider to encapsulate BytesWritable into NullableTuple 
class. 

Workflow Management As mentioned previously, one of the best way to build 
a chain of data transformations is to employ a workflow manager or a general 
purpose language. The experiences with employing Apache Oozie and Bash were 
strongly in favour of the former one. Apache Oozie is a mature solution, strongly 
established in the Apache Hadoop ecosystem, aimed for defining and executing 
(when triggered by a user, time event or data arrival) workflows. In fact, using 
Bash or Python would require a burden of implementing Apache Oozie- like tool 
e.g. for the persistence of an execution history. 

4 Summary and Future Work 

In this article we have described the experience gained in the implementation 
of CoAnSys framework. Decisions we took in the development process required 
about half a year of tries and failures. It is hard to find coherent studies of 
different algorithms' implementations and therefore we hope that this contribu- 
tion can save time of people and institutions preparing to embrace MapReduce 
paradigm and especially Apache Hadoop ecosystem into data mining systems. 

This description is the snapshot of an on-going work, hence many more im- 
provements and observations are expected to be done in a future. 
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